The Tropical Cyclones Developmental Dataset was used to develop the Statistical Hurricane Intensity Prediction Scheme (SHIPS) for predicting changes in tropical cyclone (TC) intensities (DeMaria and Kaplan (1994)). The National Hurricane Center (NHC) uses SHIPS, along with other models, to generate predictions and guide official track and intensity forecasts (“NHC Track and Intensity Models” 2009). Traditionally, SHIPS forecasts have outperformed climatology and persistence forecasts since circa 1997. However, SHIPS has not performed well in Rapid Intensification (RI) events, defined as a rapid increase in maximum windspeed over 24 hours exceeding 30, 35, or 40 knots as described in Kaplan and DeMaria (2003).
Focusing on TCs from the Atlantic Basin only, the raw data arrives in a file called lsdiaga_1982_2014_rean_sat_nbc_ts.dat which is described as Atlantic data with SHIPS predictors from either re-analysis or operational analysis with satellite variables when available. The file is one concatenated list of record sets for each storm case. Here, a case includes a current time observation (hour 0) in addition to 120 hours of forecast information (hours 6 to 120) and, in some cases, 12 hours of past information (hours -12 to -6). The case time points occur in 6 hour intervals. Each of the different cases begin with a line descriptor called HEAD and end with an end-line called LAST. Not all predictors are available for all years. To make the .dat file in a readable format, we treated each record as a fixed width format (fwf) table, skipping the header rows, with 24 columns of width 5. Once the fwf table was parsed, the table was transposed to place the attributes as columns and time points as rows. The missing value string 9999 is replaced with NAs and the parsed, transposed, and cleaned record is written to a .csv file for the sake a creating a bank of easily accessible records. As there are multiple records for one TC, the .csv file names are written as uniquestormID.recordnumber.csv.
The next step in preprocessing the data was to concatenate the hour 0 observations for each case for each storm. Before adding the observation to a master data frame, we crosschecked observations of corresponding time points. If the cases are recorded correctly, for a given storm hours 0 to 114 of the current case should be equivalent with hours 6 to 120 of the previous case for time dependent predictors. To keep track of case discrepancies, we appended a column call Match to the master data frame. Below is a detailed description of each of the raw attributes of the master data frame, mostly adapted from the predictor description file.
| Attribute Name | Description |
|---|---|
| ID | chr. Storm identifier. The first two characters “AL” represent the Atlantic basin, the second two characters represent the sequence number of a TC in a certain year, and the remaining four characters represent the year when the TC happened |
| DATE | POSIXct. Date of the storm in the format yymmdd |
| TIMESTAMP | POSIXct. Date of the storm in the format yymmdd HH |
| RECORDNM | num. The time intervals are currently in fixing. This is the record number from the original .dat file |
| VMAX | int. Maximum wind surface (kts) |
| MSLP | int. Minimum sea level pressure (hPa) |
| TYPE | factor. Storm type type (0=wave, remnant low, dissipating low, 1 = tropical, 2 = subtropical, NA = extra-tropical). |
| HIST20-HIST120 | int. Number of 6 hour periods that the storm max wind has been above 20, 25, … 120 kts. |
| DELV | int. Intensity change relative to the storm start (kts) |
| INCV | int. Intensity change relative to the previous 6 hour interval. Set to NA for land mass crossings |
| LAT | int. The latitude in 10*degrees North of the approximate storm center |
| LON | int. The longitude in 10*degrees West of the approximate storm center |
| CSST | int. Climatological sea surface temperature (deg C*10) |
| CD20 | int. Climatoligical depth (m) of 20 degree isotherm from 2005-2010 NCODA analyses |
| CD26 | int. Climatoligical depth (m) of 26 degree isotherm from 2005-2010 NCODA analyses |
| COHC | int. Climatoligical ocean heat content (kJ/cm^2) 2005-2010 NCODA analyses |
| DTL | int. Distance to nearest major land mass (km) |
| RSST | int. Reynolds sea surface temperature ( deg C*10) |
| PCHN | int. Estimated ocean heat content(kJ/cm^2) from COHC and current sea surface temperature anomaly. Designed to fill in missing RHCN |
| U200 | int. 200 hPa zonal wind speed (10*kts) for r = 200-800 km on average |
| U20C | int. 200 hPa zonal wind speed (10*kts) for r = 0-500 km on average |
| V20C | int. Vertical component of 200 hPa zonal wind speed(10*kts) for r = 200-800 km on average |
| E000 | int. 1000 hPa equivalent potential temperature, \(\theta_e\) for r = 200-800 km on average (K) |
| EPOS | int. Average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average (K) |
| ENEG | int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included (K) |
| EPSS | int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included with \(\theta_e\) compared with the saturated \(\theta_e\) of the environment (K) |
| ENSS | int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included with \(\theta_e\) compared with the saturated \(\theta_e\) of the environment (K) |
| RHLO | int. 850-700 hPa relative humidity(%) for 200-800 km |
| RHMD | int. 700-500 hPa relative humidity(%) for 200-800 km |
| RHHI | int. 500-300 relative humidity(%) for 200-800 km |
| PSLV | int. Pressure of the center of mass of the layer where the storm motion best matches environmental flo. Used to calculate steering pressure as well (hPa) |
| Z850 | int. 850 hPa vorticity (\(sec^{-1}*10^7\)) for r = 0-1000 km |
| D200 | int. 200 hPa divergence (\(sec^{-1}*10^7\)) for r = 0-1000 km |
| REFC | int. Relative eddy momentum flux convergence (\(m/s/day\)) for r = 100-600 km on average |
| PEFC | int. Planetary eddy momentum flux convergence (\(m/s/day\)) for r = 100-600 km on average |
| T000 | int. 1000 hPa temperature (deg C*10) 200-800 km average |
| R000 | int. 1000 hPa relative humidity 200-800 km average |
| Z000 | int. 1000 hPa height deviation (m) from the U.S. standard atmosphere |
| TLAT | int. Latitude of 850 hPa vortex center in NCEP analysis (10*deg N) |
| TLON | int. Longitude of 850 hPa vortex center in NCEP analysis (10*deg N) |
| TWAC | int. Symmetric tangential wind at 850 hPa from NCEP analysis 0-600 kn average (\(m/sec*10\)) |
| TWXC | int. Maximum 850 hPa symmetric tangential wind at 850 hPa from NCEP analysis (\(m/sec*10\)) |
| G150 | int. Temperature perturbation at 150 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10) |
| G200 | int. Temperature perturbation at 200 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10) |
| G250 | int. Temperature perturbation at 250 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10) |
| V000 | int. Tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\)) |
| V850 | int. 850 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\)) |
| V500 | int. 500 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\)) |
| V300 | int. 300 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\)) |
| TGRD | int. Magnitude of the temperature gradient between 850 and 700 hPa averaged from 0 to 500 km estimated from the geostrophic thermal wind (\(degC/m*10^7\)) |
| TADV | int. The temperature advection between 850 and 700 hPa averaged from 0 to 500 km from the geostrophic thermal wind (\(degC/sec*10^6\)) |
| PENC | int. Azimuthally averaged surface pressure at outer edge of vortex \(( (hPa-1000)*10)\) |
| SHDC | int. Shear magnitude (kts*10) vs time (200-800 km) with vortex removed and averaged from 0-500 km relative to 850 hPa vortex center |
| SDDC | int. Heading in degrees of above shear vector where westerly shear is valued at 90 degrees |
| SHGC | int. Generalized 850–200 hPa shear magnitude (kts*10) (takes into account all levels) with vortex removed and averaged from 0-500 km relative to 850 hPa vortex center |
| DIVC | int. Divergence (\(sec^{-1}*10^7\)) for r = 0-1000 km centered at 850 hPa vortex location |
| T150 | int. 150 hPa temperature (deg C*10) versus time 200 to 800 km |
| T200 | int. 200 hPa temperature (deg C*10) versus time 200 to 800 km |
| T250 | int. 250 hPa temperature (deg C*10) versus time 200 to 800 km |
| SHRD | int. 850-200 hPa shear magnitude (kts*10) vs time 200-800 km |
| SHRS | int. 850-500 hPa shear magnitude (kts*10) |
| SHTS | int. Heading above sheer vector (deg) |
| SHRG | int. Generalized 850-200 hPa shear magnitude (kts*10) (takes into account all levels) |
| PENV | int. 200 to 800 km average surface pressure \(((hPa-1000)*10)\) |
| VMPI | int. Maximum potential intensity from Kerry Emanuel equation (kts) |
| VVAV | int. Average (0 to 15 km) vertical velocity (\(m/s *100\)) of a parcel lifted from the surface where entrainment, the ice phase and the condensate weight are accounted for. Source note: Moisture and temperature biases between the operational and reanalysis files make this variable inconsistent in the 2001-2007 samples, compared to 2000 and before |
| VMFX | int. VVAV with a density weighted vertical average |
| VVAC | int. VVAV with soundings from 0-500 km with GFS vortex removed |
| HE07 | undefined |
| HE05 | undefined |
| IRXX | int. Non-satellite GOES model predictors used to generate IR00 |
| RD20 | int. Ocean depth of the 20 deg C isotherm (m), from satellite altimetry data |
| RD26 | int. Ocean depth of the 26 deg C isotherm (m) from satellite altimetry data |
| RHCN | int. Ocean heat content (KJ/cm2) from satellite altimetry data |
| Match | logical. Crosscheck for a given storm, where hours 0 to 114 of the current case should be equivalent with hours 6 to 120 of the previous case for time dependent predictors |
| IR00_AVG_200BT | int. Average GOES 4 satellite brightness temp r=0-200 km (deg C *10) |
| IR00_STD_200BT | int. Standard deviation of GOES 4 satellite brightness temp r=0-200 km (deg C *10) |
| IR00_AVG_300BT | int. Average GOES 4 satellite brightness temp r=100-300 km (deg C *10) |
| IR00_STD_300BT | int. Standard deviation GOES 4 satellite brightness temp r=100-300 km (deg C *10) |
| IR00_PCT_AREA_10BT | int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -10 C |
| IR00_PCT_AREA_20BT | int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -20 C |
| IR00_PCT_AREA_30BT | Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -30 C |
| IR00_PCT_AREA_40BT | int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -40 C |
| IR00_PCT_AREA_50BT | int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -50 C |
| IR00_PCT_AREA_60BT | int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -60 C |
| IR00_MAX_BT | int. Maximum brightness temp r = 0-30 km (deg C *10) |
| IR00_AVG_30BT | int. Average brightness temp r = 0-30 km (deg C *10) |
| IR00_RADIUS_MAXBT | int. Radius of maximum brightness temp (km) |
| IR00_MIN_20BT | int. Minimum brightness temp r = 20-120 km (deg C *10) |
| IR00_AVG_20BT | int. Average brightness temp r = 20-120 km (deg C *10) |
| IR00_RADIUS_MINBT | int. Radius of minimum brightness temp (km) |
| IR3* | ints. Same as the IR00 vars three hours before initial case time |
| JDATE | int. The absolute value of the Julian date minus the peak date of the season according to DeMaria and Kaplan (1994) |
| POT | int. The intensification potential, that is VMPI - VMAX (kts) |
Without any further cleaning, the data is 10705 rows of realtime instances and 135 columns of attributes to each instance. There are 460 unique TCs ranging from 1982-06-02 to 2014-10-28. The storm paths are again limited to the Atlantic Basin as seen below.
In the cross check for time point inconsistencies, there are a total of 38 inconsistencies. As documented in Jankulak (2012), 18 of these inconsistencies belong to the REFC attribute. Another 10 of these inconsistencies are due to the TYPE attribute. The remaining 10 inconsistencies belong to 9 different storms, where the storm is missing at least one day of data. For example, Hurricane Nadine, the longest storm in this set of records, has a missing cases between 120921 and 120923. We can see the missing data in the lat/long path below
Next, a check to see that the storm history variables beginning with HIST are in fact a cumulative history. For the first TC, the HIST vars look like:
A little messier, all the storms look like:
Checking numerically by row that the history variables are different by either 1 or 0 from the previous variable observation by storm, we find a total of 16 storms with history issues, including the 9 storms with time point inconsistencies. These storm IDs,
## [1] "AL042013" "AL052011" "AL072012" "AL091997" "AL092014" "AL132005"
## [7] "AL282005"
show history inconsistencies, which under further investigation, is a result of date inconsistencies. As it would happen, this data set only includes cases that qualify as tropical or sub-tropical storms. If the storm weakens, it is no longer classified into either of these two categories and the data is no longer available. If the storm picks up speed, the tracking resumes explaining what happened to NADINE and the history inconsistent storms. A small note, there are also inconsistencies between storms in the case-dependent TYPE and REFC attributes. Without these variables included in the checks and accounting for missing dates, all inconcsistencies are resolved.
The storm instances have a binary label, 1 indicating RI and 0 indicating nonRI (for any increase in windspeed greater than 30 kts within 24 hrs), in addition to a VMAX_Diff column that measures the future 24 hour change in intensity relative to the current time point. An NA in the RI and VMAXDiff column signify that the data are not available for the timepoints needed to calculate these attributes (storms like NADINE). The distribution of maximum intensities per time interval for the nonRI and RI storms are shown below. Keep in mind that NAs are likely equivalent to nonRI events as the storm has weakened.
In terms of class imbalance, there are 540 RI instances compared to 10068 nonRI instances. The VMAX Diff shown below ranges from NA to NA.
The longest storm in the data is AL142012 lasting a total of 558 hours. The shortest storm is AL052010 lasting a total of 6 hours. The distribution of storm duration for all storms, colored by the number of RI events in each storm, looks like:
In terms of the time series view, a snapshot look at VMAX vs TIME:
Similarly, the 24 change in maximum windspeed varies widely for the storms that contain RI events.
Dealing specifically with storms where we have missing data because of dissipation, there are exactly 25 dissipating storms out of 460 total storms. The dissipating storms are plotted below
To summarize the amount of missing data in eac of the attributes, we first subset to the timepoints with and with GOES satelite information available. As of March 2009, the satelites were backfilled to 1983, wheras they were only available from 1995 previously. A summary of the 5 storms up to 1983:
And a summary of the 306 storms post 1983:
To drill down specifically, the largest amount of missing data comes from satelites, with 150 of the 306 post `83 storms missing IR information.
DeMaria, Mark, and John Kaplan. 1994. “A Statistical Intensity Prediction Scheme (SHIPS) for the Atlantic Basin.” Weather and Forescating 9 (June): 209–20.
Jankulak, Michael L. 2012. “Prediction of Rapid Intensity Changes in Tropical Cyclones Using Associative Classification.” PhD thesis, University of Miami.
Kaplan, John, and Mark DeMaria. 2003. “Large-Scale Characteristics of Rapidly Intensifying Tropical Cyclones in the North Atlantic Basin.” Weather and Forecasting 18 (6): 1093–1108.
“NHC Track and Intensity Models.” 2009.